Clustering

Unsupervised Learning

So far in this class, we’ve only done supervised learning. Meaning, we have a response variable and we observe its value for all or some of our observations.


Clustering is a type of unsupervised learning. Meaning, we want to sort our observations into clusters based on the predictors, but we don’t have a pre-conceived notion of what those clusters represent!

Clustering Usage

Clustering

The general goal of clustering is to make clusters such that points within a cluster are closer to each other than to the points outside the cluster.


What is our definition of close?

How many clusters do we think exist?

What algorithm do we use to select the clusters?

K-means Clustering Algorithm

Idea: Iteratively update the centers of the clusters until convergence.

  1. Plop \(k\) points down in space. These are the centroids.

There are many ways to choose initial centroids!

  1. Determine which centroid each observation is closest to. Assign it to that cluster.

K-means Clustering Algorithm Continued

  1. Find the mean of each cluster. These are the new centroids.
  1. Continue this process (assigning points to clusters and updating the centroids) until the centroids don’t change.

A K-Means Annimation


https://www.naftaliharris.com/blog/visualizing-k-means-clustering/

K-means Clustering in R

The kmeans() function must be given a numeric matrix / dataframe and a number of clusters.

It gives back centroids, cluster assignments, and sum-of-squares.


fed_km <- kmeans(fed_matrix, 
                 centers = 3)

Optional Arguments

nstart – the number of random sets that should be chosen, is by default set to 1

algorithm – allows you to choose which K-means algorithm you would like to use (Loyd, MacQueen, or Hartigan-Wong)

Centers

fed_km$centers
        a        do       is       or     this      all       down      it
1 0.01133 0.0005557 0.006617 0.011255 0.004472 0.002808 0.00000000 0.01659
2 0.01991 0.0003927 0.011035 0.006990 0.007490 0.003988 0.00020684 0.01388
3 0.02264 0.0004926 0.011389 0.006299 0.007694 0.003692 0.00006095 0.01290
        our      to      also      even      its    shall         up       an
1 0.0044983 0.03462 0.0013692 0.0005221 0.002432 0.001175 0.00000000 0.002015
2 0.0007316 0.03991 0.0004887 0.0008273 0.003761 0.001594 0.00006386 0.005143
3 0.0020956 0.03882 0.0003545 0.0009570 0.003617 0.001507 0.00037792 0.005632
      every      may   should      upon     and     for.     more       so
1 0.0004988 0.003866 0.002959 0.0001198 0.04885 0.006796 0.005968 0.003247
2 0.0015005 0.004252 0.002164 0.0024354 0.02481 0.006119 0.002760 0.001935
3 0.0018063 0.004356 0.001963 0.0027342 0.02559 0.006711 0.002886 0.002195
       was      any     from     must      some     were      are      had
1 0.001699 0.002560 0.006205 0.001444 0.0015569 0.001991 0.005777 0.001105
2 0.001450 0.002925 0.004930 0.001991 0.0009518 0.001211 0.004672 0.001243
3 0.001539 0.003022 0.005779 0.002555 0.0014136 0.001507 0.005327 0.001454
         my     such     what       as      has       no     than      when
1 0.0003618 0.003601 0.001148 0.011498 0.001965 0.001162 0.004317 0.0016872
2 0.0001322 0.002073 0.001191 0.008948 0.002657 0.002629 0.002846 0.0008404
3 0.0003375 0.002146 0.001287 0.008645 0.003715 0.002593 0.003044 0.0010076
        at     have      not    that    which      be       her       now
1 0.002607 0.005895 0.007625 0.01751 0.006758 0.01880 0.0010409 0.0004457
2 0.002745 0.006013 0.006159 0.01470 0.010791 0.02168 0.0001255 0.0004326
3 0.003359 0.007324 0.006319 0.01493 0.011130 0.01995 0.0008013 0.0004449
      the      who     been       his      of    their     will      but
1 0.06104 0.003742 0.001825 0.0006076 0.04376 0.009810 0.008446 0.004781
2 0.10641 0.001867 0.003792 0.0017050 0.06602 0.004578 0.006730 0.003198
3 0.08707 0.002502 0.004712 0.0021895 0.06232 0.005920 0.006149 0.003631
       if.       on      then     with       by     in.      one    there
1 0.004054 0.005086 0.0005555 0.007253 0.009479 0.01921 0.005726 0.001024
2 0.003313 0.004203 0.0004346 0.005784 0.007441 0.02323 0.002792 0.002621
3 0.003052 0.004274 0.0003348 0.005290 0.008641 0.02383 0.002618 0.002865
     would      can     into     only     things      your
1 0.009061 0.002233 0.003068 0.003100 0.00008643 0.0004461
2 0.007842 0.002768 0.001345 0.001586 0.00018331 0.0000000
3 0.007405 0.002344 0.001717 0.001319 0.00024624 0.0002759

Sums of Squares

Total Sum of Squares

fed_km$totss
[1] 0.0512

Within Sum of Squares

fed_km$withinss
[1] 0.002869 0.013263 0.017125

Between Sum of Squares

fed_km$betweenss
[1] 0.01794

Cluster Assignments

res <- tibble(
  clust = fed_km$cluster, 
  auth = fed_known$Author)

res
# A tibble: 70 × 2
   clust auth 
   <int> <chr>
 1     3 AH   
 2     1 JJ   
 3     1 JJ   
 4     1 JJ   
 5     1 JJ   
 6     3 AH   
 7     3 AH   
 8     3 AH   
 9     3 AH   
10     3 JM   
# ℹ 60 more rows
res %>% 
  count(clust, auth)
# A tibble: 5 × 3
  clust auth      n
  <int> <chr> <int>
1     1 JJ        5
2     2 AH       20
3     2 JM        6
4     3 AH       31
5     3 JM        8

K-means + PCA

Did we really need all 200 variables to find those clusters?

Did we maybe “muddy the waters” by weighing all variables equally?


It is very common to do a PCA reduction before running K-means!

Integrating PCA with K-means

Pre-processing

pc <- prcomp(fed_matrix, 
             center = TRUE, 
             scale = TRUE)

fed_reduced <- pc$x[, 1:2] 

Fitting

fed_pca_km <- kmeans(fed_reduced, 
                     centers = 3)

Inspecting

res3 <- tibble(
  clust = fed_pca_km$cluster, 
  auth = fed_known$Author)

res3 %>% 
  count(clust, auth)
# A tibble: 5 × 3
  clust auth      n
  <int> <chr> <int>
1     1 AH       43
2     2 AH        8
3     2 JJ        1
4     2 JM       14
5     3 JJ        4

What if we’d done four centroids?

Three Centroids

# A tibble: 5 × 3
  clust auth      n
  <int> <chr> <int>
1     1 AH       43
2     2 AH        8
3     2 JJ        1
4     2 JM       14
5     3 JJ        4

Four Centroids

# A tibble: 8 × 3
  clust auth      n
  <int> <chr> <int>
1     1 AH       22
2     1 JM        2
3     2 AH        3
4     2 JJ        1
5     2 JM        9
6     3 AH       26
7     3 JM        3
8     4 JJ        4

K-means Takeaways

Pros:

  • Simple algorithm, easy to understand

  • Plays nice with PCA

  • SUPER fast to compute

Cons:

  • Very sensitive to location of initial centroids

  • User has to pick the number of clusters

Try it!

Open a Activity-Clustering.Rmd`

  1. Apply k-means clustering to the cannabis data using all the word predictors.
  2. What was the within and between sum of squares?
  3. Did the clusters match up with the Type?

Now, refer back to your PCA analysis of the cannabis data (from Monday).

  1. Apply k-means clustering to the second and third PC only
  2. Plot these clusters. What do you think they capture?

Hierarchical Clustering

(also called agglomerative clustering)

Idea: Merge observations that are close together.

  1. Find the closest two observations. Replace them with their centroid.
  1. Find the next two closest observations. (One might be the centroid!) Replace them with their centroid.
  1. Continue until all observations have been merged.

Merging Observations

Hierarchical Clustering in R

The hclust() function needs to be given a distance matrix.


fed_hc <- fed_matrix %>% 
  hclust()
Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor exceed 65536"): missing value where TRUE/FALSE needed

Hierarchical Clustering in R

When you give hclust() a distance matrix…

fed_hc <- fed_matrix %>% 
  dist() %>% 
  hclust()

it gives back a (not super pretty) dendrogram.

plot(fed_hc, 
     labels = fed_known$Author)

Choosing Clusters from Dendrograms

To decide how to assign clusters, we can choose how many clusters (k) we want we can use cuttree() to cut the dendrogram into a specific number of clusters.

res_hc <- cutree(fed_hc, 
                 k = 3)

res_hc
 [1] 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1
[39] 1 3 3 3 3 3 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
tibble(
  clust = res_hc,
  auth = fed_known$Author
) %>%
  count(clust, auth)
# A tibble: 4 × 3
  clust auth      n
  <int> <chr> <int>
1     1 AH       51
2     1 JM        8
3     2 JJ        5
4     3 JM        6

Option 2

Or we can choose a height cutoff (h) for the dendrogram

res_hc_2 <- cutree(fed_hc, 
                   h = 0.05)
res_hc_2
 [1] 1 2 2 2 2 1 1 1 1 1 1 1 1 1 1 3 3 3 1 3 1 1 1 3 3 3 1 1 3 1 1 1 1 1 1 4 1 1
[39] 3 4 4 4 4 4 1 1 1 1 2 3 3 3 1 3 3 1 1 1 3 3 1 3 1 1 3 1 3 1 3 3
tibble(
  clust = res_hc_2,
  auth = fed_known$Author
) %>%
  count(clust, auth)
# A tibble: 6 × 3
  clust auth      n
  <int> <chr> <int>
1     1 AH       31
2     1 JM        7
3     2 JJ        5
4     3 AH       20
5     3 JM        1
6     4 JM        6

Hierarchical Clustering Takeaways

Pros:

  • Fast computation for moderate sized data

  • Gives back full information in dendrogram form

Cons:

  • User has to decide how to go from dendrogram to cluster assignments

Try it!

  1. Apply hierarchical clustering to the cannabis data
  2. Compare your results to k-means.
  3. Which method do you prefer? Why?